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DETAILED ACTION 



Claim Objections 

1 . Claim 27 is objected to because of the following informalities: "sequented" should be 
changed to "subsequent" in line 6. Appropriate correction is required. 



Claim Rejections - 35 USC § 102 

2. The following is a quotation of the appropriate paragraphs of 35 U.S. C. 102 that form the 
basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - 

(e) the invention was described in (1) an application for patent, published under section 122(b), by another filed 
in the United States before the invention by the applicant for patent or (2) a patent granted on an application for 
patent by another filed in the United States before the invention by the applicant for patent, except that an 
international application filed under the treaty defined in section 35 1(a) shall have the effects for purposes of this 
subsection of an application filed in the United States only if the international application designated the United 
States and was published under Article 2 1 (2) of such treaty in the English language. 

3. Claims 47 and 48 are rejected under 35 U.S.C. 102(e) as being anticipated by Girod 
(6,483,532). 

Regarding claim 47, Girod discloses tracking face segments 356 in a video track (Figure 
3), detecting segments having mouth motion 358 (Figure 3), and estimating from the detected 
segments, those having talking mouth motion vs. non-talking mouth motion (Abstract; Col. 6, 
lines 49-67, Col. 7, lines 1-21). 

Regarding claim 48, Girod discloses segments estimated to have talking mouth motion 
are enabled for attaching speech to a speaker in a face segment (Col. 10, lines 25-33). 
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4. Claims 1-6, 17, 18, 22, 23, 27, 29, 30, 32-35, 37, 39-42, 49, 50, and 58-61 are rejected 
under 35 U.S.C. 102(e) as being anticipated by Ariki et al. ("Face Indexing on Video Data - 
Extraction, Recognition, Tracking and Modeling"). 

Regarding claim 1, Ariki et al. ("Ariki") discloses processing a collection of images to 
extract therefrom features characteristic of a class of objects (Sect. 1-4) and grouping the images 
according to the extracted features helpful in identifying individual members of the class of 
objects (Sect. 1). 

Regarding claim 2, Ariki discloses the collection of images representing a sequence of 
video frames (Abstract). 

Regarding claim 3, Ariki discloses grouping of images in groups according to extracted 
features helpful in identifying individual members of a class produces a track of continuous 
frames for each individual member of the class of objects to be identified (Abstract; Sect. 1). 

Regarding claim 4, Ariki discloses the groups of images are stored in a database store for 
use in searching or browsing for individual members of a class (Sect. 1). 

Regarding claim 5, Ariki discloses the class of objects is human faces (Abstract). 

Regarding claim 6, Ariki discloses the collection of images representing a sequence of 
video frames and the grouping including forming face tracks of contiguous frames (Sect. 1), each 
track including time sections (Abstract), thereby including the starting and ending frames, and 
containing face regions (Sect. 1). 

Regarding claim 17, Ariki discloses the sequence of video frames is process to include 
annotations (e.g. name) associated with the face track (Abstract; Sect. 8). 
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Regarding claim 18, Ariki discloses the sequence of video frame is processed to include 
face characteristic views associated with the face tracks (Sect. 4). 

Regarding claim 22, Ariki discloses processing a sequence of video frames to extract 
therefrom facial features and grouping the video frames to produce face tracks of contiguous 
frames each face track being identified by the starting and ending frames in the track (Abstract; 
Sect. 1) and containing face characteristic data of an individual face (Sect. 1, Sect. 3-4). 

Regarding claim 23, Ariki discloses the face tracks are stored in a face index for use in 
searching or browsing for individual faces (Sect. 1). 

Regarding claim 27, Ariki discloses detecting predetermined facial features in a frame 
and utilizing the facial features for estimating a head boundary for the respective face (Sect. 1; 
Sect. 3-4), opening a tracking window based on the estimated head boundary, and utilizing the 
tracking window for tracking subsequent frames which include the predetermined facial feature 
(Sect. 6). 

Regarding claims 29 and 30, the arguments analogous to those presented above for 
claims 17 and 18 are applicable to claims 29 and 30, respectively. 

Regarding claim 32, the arguments analogous to those presented above for claims 1 and 
17 are applicable to claim 32. 

Regarding claims 33-35, the arguments analogous to those presented above for claims 1- 
3 are applicable to claims 33-35, respectively. 

Regarding claims 37 and 39, the arguments analogous to those presented above for 
claims 6 and 18 are applicable to claims 37 and 39, respectively. 
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Regarding claim 40, Ariki discloses annotating the names to each entry in the index 
(Abstract; Sect. 8). 

Regarding claim 41, Ariki discloses the descriptions are applied manually during an 
editing operation (Sect 1, para. 3). 

Regarding claim 42, Ariki discloses the description are applied automatically by means 
of a stored dictionary (Sect. 6). 

Regarding claim 49, Ariki discloses processing a sequence of video frames to generate a 
face index having a plurality of entries, and attaching descriptions to entries in the face index 
(Abstract; Sect. 1). 

Regarding claim 50, Ariki discloses attaching descriptions from the face index to at least 
one video frame (Sect. 8). 

Regarding claim 58, Ariki discloses processing a collection of images to generate an 
index of a class of objects (Abstract), and searching the index for individual members of the 
class of objects (Abstract; Sect. 1). 

Regarding claim 59, Ariki discloses the collection of images are a sequence of video 
frames (Abstract). 

Regarding claim 60, Ariki discloses processing the sequence of video frames producing a 
track of sequential frames for each individual member of the class of objects (Abstract; Sect. 1). 
Regarding claim 61, Ariki discloses the class of objects are human faces (Abstract). 
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Claim Rejections - 35 USC §103 

5. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subject matter as a whole would have been obvious at the time the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

6. Claims 43-46 are rejected under 35 U.S.C. 103(a) as being unpatentable over Wactlar et 
al. (5,835,667) in view of Girod (6,483,532). 

Regarding claim 43, Wactlar et al. ("Wactlar") discloses generating an index from a 
video track (Col, 11, lines 43-63, Col. 12, lines 1-40), generating a transcription of the audio 
track (Col. 7, lines 39-67, Col. 8, lines 1-14), and aligning the transcription with the video index 
(Col. 8, lines 15-19). Wactlar discloses generating an index from a video track including image 
recognition (Col. 1, lines 7-11) and tracking objects (Col. 1 1, lines 43-67, Col. 12, lines 1-40) but 
does not expressly disclose generating a face index. However, Girod discloses detecting and 
tracking a face from a video signal (Figure 3; Col. 6, lines 58-67, Col. 7, lines 1-21). Wactlar 
and Girod are combinable because they are from the same field of endeavor of video processing. 
At the time of the invention, it would have been obvious to a person of ordinary skill in the art to 
have modified the video indexing disclosed by Wactlar to specify generating a face index. The 
motivation for doing so would have been because it is a methodology routinely implemented in 
the art and allows for speaker identification. Therefore, it would have been obvious to one of 
ordinary skill in the art to have combined Wactlar with Girod to obtain the invention as specified 
in claim 43. 
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Regarding claim 44, Wactlar discloses the transcription is generated by closed-caption 
decoding (Col. 7, lines 39-64). 

Regarding claim 45, Wactlar discloses the transcription is generated by speech 
recognition (Col. 7, lines 39-64). 

Regarding claim 46, Wactlar discloses identifying start and end points of speech 
segments in the transcription (Col. 8, lines 15-20), extracting from the video index start and end 
points of the segments (Col 1 1 , lines 43-67), and concurrent use of image and speech timing 
information to increase the reliability (Col. 12, lines 41-51). Wactlar does not appear to 
recognize outputting all speech segments that have non-zero temporal intersection with the 
respective face segment. However, Girod discloses outputting all speech segments that have 
non-zero temporal intersection with the respective face segment (Col. 3, lines 31-50; Col. 6, lines 
49-67, Col. 7, lines 1-21; Col. 10, lines 25-33). At the time of the invention, it would have been 
obvious to a person of ordinary skill in the art to have modified the speech segments disclosed by 
Wactlar to include outputting all speech segments that have non-zero temporal intersection with 
the respective face segment. The motivation for doing so would have been because it increases 
the reliability of the system by processing the audio signal in response to the detected movement 
of a person in a video signal thereby permitting elimination of background noise from sites 
where the person is not speaking. Therefore, it would have been obvious to a person of ordinary 
skill in the art to have combined Wactlar with Girod to obtain the invention as specified in claim 
46. 
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7. Claims 16, 3 1 , and 36 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Ariki et al. ("Face Indexing on Video Data - Extraction, Recognition, Tracking and Modeling") 
as applied to claims 6, 22, and 35 above. 

Regarding claims 16 and 31, Ariki discloses grouping tracks containing similar faces 
(Abstract; Sect. 1) and recording time sections along with their names as indices (Abstract). 
While Ariki does not appear to expressly state merging tracks containing similar faces from a 
plurality of face tracks, it would have been obvious in light of his disclosure. 

Regarding claim 36, Ariki discloses including annotations (e.g. name) associated with the 
face track (Abstract; Sect. 8), but does not appear to recognize displaying a list. However, 
generating and displaying a list of data is well known and routinely utilized in the art. Therefore, 
it would have been obvious to one of ordinary skill in the art to have modified the annotations 
disclosed by Ariki to include displaying them as a list because it provides a visual aid to the user. 

8. Claims 51 and 52 rejected under 35 U.S.C. 103(a) as being unpatentable over Ariki et al. 
("Face Indexing on Video Data - Extraction, Recognition, Tracking and Modeling") and Nam et 
al. ("Speaker Identification and Video Analysis for Hierarchical Video Shot Classification"). 

Regarding claim 51, Ariki et al. ("Ariki") discloses extracting from the video track face 
segments representing human faces and producing a face track for each individual face 
(Abstract; Sect. 1). Ariki does not appear to disclose extracting audio segments. However, Nam 
et al. ("Nam") discloses extracting audio segments from the audio track (Abstract), fitting a 
model based on a set of audio segments corresponding to the individual of video (Sect. 2.2), and 
associating the model with the video track of the corresponding individual (Sect. 1, Sect. 2). 
Ariki and Nam are combinable because they are from the same field of endeavor of video 
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indexing. At the time of the invention, it would have been obvious to one of ordinary skill in the 
art to have modified the face indexing disclosed by Ariki to include associating audio 
information with the corresponding face track. The motivation for doing so would have been 
because both tracks provide important information to understand and organize the video data in 
an efficient manner. Therefore, it would have been obvious to one of ordinary skill in the art to 
combine Ariki with Nam to obtain the invention as specified in claim 51 . 

Regarding claim 52, the arguments analogous to those presented above for claim 51 are 
applicable to claim 52. Nam discloses the audio segments are speech segments (Sect. 1). 
9. Claims 19-21 and 24-26 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Ariki et al. ("Face Indexing on Video Data - Extraction, Recognition, Tracking and Modeling") 
as applied to claims 6 and 22 above, and further in view of Podilchuk ("Face Recognition Using 
DCT-Based Feature Vectors"). 

Regarding claims 19-21 and 24-26, Ariki discloses face regions but does not specify 
including eye, nose, and mouth templates, image coordinates of geometric face features, or 
coefficients of the eigen-face representation. However, Podilchuk discloses that it is well known 
to use templates based on facial parts, geometric face features, and eigen-face representations 
(Sect. 1; Sect. 3). At the time of the invention, it would have been obvious to a person of 
ordinary skill in the art to include templates, geometric features, or eigen-face representation. 
Applicant has not disclosed that either one provides an advantage, is used for a particular 
purpose, or solves a stated problem. One of ordinary skill in the art, furthermore, would have 
expected Applicant's invention to perform equally well with either the face region disclosed by 
Ariki or using templates, geometric features, or eigen-face representation because they perform 
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the same function of face detection. Therefore, it would have been obvious to one of ordinary 
skill in the art to combine Ariki with Podilchuk to obtain the invention as specified in claims 1 9- 
21 and 24-26. 

10. Claims 7, 8-10, 28, an 38 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Ariki et al. ("Face Indexing on Video Data - Extraction, Recognition, Tracking and Modeling") 
as applied to claims 6, 22, and 37 above, and further in view of Wactlar et al. (5,835,667). 

Regarding claims 7, 28, and 38, Ariki does not appear to recognize including audio data. 
Wactlar discloses associating audio data with video data (Abstract). Ariki and Wactlar are 
combinable because they are from the same field of endeavor of video indexing. At the time of 
the invention, it would have been obvious to one of ordinary skill in the art to have modified the 
face indexing disclosed by Ariki to include associating audio information with the corresponding 
face track. The motivation for doing so would have been because both tracks provide important 
information to understand and organize the video data in an efficient manner. Therefore, it 
would have been obvious to one of ordinary skill in the art to combine Ariki with Wactlar to 
obtain the invention as specified in claims 7, 28, and 38. 

Regarding claim 8, Ariki discloses generating a face index from the sequence of video 
frames (Abstract). The arguments analogous to those presented above for claim 7 are applicable 
to claim 8. Wactlar discloses generating an index from a video track (Col. 11, lines 43-63, Col. 
12, lines 1-40), generating a transcription of the audio track (Col. 7, lines 39-67, Col. 8, lines 1- 
14), and aligning the transcription with the video index (Col. 8, lines 15-19). 
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Regarding claims 9 and 10, the arguments analogous to those presented above for claim 7 
are applicable to claims 9 and 10. Wactlar discloses the transcription of the audio data is 
generated by closed-caption decoding and speech recognition (Col. 7, lines 39-64). 
1 1 . Claims 11-13 are rejected under 35 U.S. C. 103(a) as being unpatentable over Ariki et al. 
("Face Indexing on Video Data - Extraction, Recognition, Tracking and Modeling") and Wactlar 
(5,835,667) as applied to claims 7 and 8 above, and further in view of Girod (6,483,532). 

Regarding claim 1 1, Wactlar discloses identifying start and end points of speech 
segments in the transcription (Col. 8, lines 15-20), extracting from the video index start and end 
points of the segments (Col. 11, lines 43-67), and concurrent use of image and speech timing 
information to increase the reliability (Col. 12, lines 41-51). Wactlar does not appear to 
recognize outputting all speech segments that have non-zero temporal intersection with the 
respective face segment. However, Girod discloses outputting all speech segments that have 
non-zero temporal intersection with the respective face segment (Col. 3, lines 31-50; Col. 6, lines 
49-67, Col. 7, lines 1-21; Col. 10, lines 25-33). At the time of the invention, it would have been 
obvious to a person of ordinary skill in the art to have modified the speech segments disclosed by 
Ariki and Wactlar to include outputting all speech segments that have non-zero temporal 
intersection with the respective face segment. The motivation for doing so would have been 
because it increases the reliability of the system by processing the audio signal in response to the 
detected movement of a person in a video signal thereby permitting elimination of background 
noise from sites where the person is not speaking. Therefore, it would have been obvious to a 
person of ordinary skill in the art to have combined Ariki and Wactlar with Girod to obtain the 
invention as specified in claim 1 1 . 
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Regarding claim 12, Ariki discloses tracking face regions (Abstract) but neither Ariki nor 
Wactlar appear to recognize detecting mouth motion. However, Girod discloses tracking face 
segments 356 in a video track (Figure 3), detecting segments having mouth motion 358 (Figure 
3), and estimating from the detected segments, those having talking mouth motion vs. non- 
talking mouth motion (Abstract; Col. 6, lines 49-67, Col. 7, lines 1-21). At the time of the 
invention, it would have been obvious to one of ordinary skill in the art to have modified the face 
detection disclosed by Ariki and Wactlar to include detecting mouth motion. The motivation for 
doing so would have been because it is well known and routinely utilized in the art of face 
detection and increases the versatility of the system by determining if the person is talking. 
Therefore, it would have been obvious to a person of ordinary skill in the art to combine Ariki 
and Wactlar with Girod to obtain the invention as specified in claim 12. 

Regarding claim 13, the arguments analogous to those presented above for claim 12 are 
applicable to claim 13. Girod discloses segments estimated to have talking mouth motion are 
enabled for attaching speech to a speaker in a face segment (Col. 10, lines 25-33). 
12. Claims 14 and 15 are rejected under 35 U.S. C. 103(a) as being unpatentable over Ariki et 
al. ("Face Indexing on Video Data - Extraction, Recognition, Tracking and Modeling") and 
Wactlar (5,835,667) as applied to claim 7 above, and further in view of Nam et al. ("Speaker 
Identification and Video Analysis for Hierarchical Video Shot Classification"). 

Regarding claim 14, the arguments analogous to those presented above for claim 7 are 
applicable to claim 14. Wactlar does not appear to recognize fitting a model based on the 
extracted audio data. However, Nam et al. ("Nam") discloses extracting audio segments from 
the audio track (Abstract), fitting a model based on a set of audio segments corresponding to the 
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individual of video (Sect. 2.2), and associating the model with the video track of the 
corresponding individual (Sect. 1, Sect. 2). At the time of the invention, it would have been 
obvious to one of ordinary skill in the art to have modified the face indexing disclosed by Ariki 
and Wactlar to include associating audio information with the corresponding face track. The 
motivation for doing so would have been because both tracks provide important information to 
understand and organize the video data in an efficient manner and further enhance the searching 
capability of the system. Therefore, it would have been obvious to one of ordinary skill in the art 
to combine Ariki and Wactlar with Nam to obtain the invention as specified in claim 14. 

Regarding claim 15, the arguments analogous to those presented above for claim 14 are 
applicable to claim 15. Nam discloses the audio segments are speech segments (Sect. 1). 
13. Claims 53, 54, and 57 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Ariki et al. ("Face Indexing on Video Data - Extraction, Recognition, Tracking and Modeling") 
in view of Belongie et al. ("Color- Texture-Based Image Segmentation Using EM and Its 
Application to Content-Based Image Retrieval"). 

Regarding claim 53, Ariki discloses a processor for processing a collection of images to 
extract therefrom features characteristic of objects, and for outputting therefrom indexing data 
with respect to the features (Abstract; Sect. 1-2) and storing indexing data outputted from the 
processor in groups according to the features selected for extraction (Sect. 6; Sect. 8). Ariki does 
not appear to disclose including a user interface for selecting the features to be extracted to 
enable searching for and identifying individual members of the class of objects. However, 
Belongie et al. ("Belongie") discloses including a user interface for selecting features to be 
extracted to enable searching for and identifying individual members (Sect. 1). At the time of 
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the invention, it would have been obvious to one of ordinary skill in the art to have modified the 
feature extraction disclosed by Ariki to include a user interface. The motivation for doing so 
would have been because it is well known in the art and it facilitates classification or retrieval of 
the individual members. 

Regarding claim 54, Ariki discloses a browser-searcher for browsing and searching the 
store of indexing data to locate particular members of the class of objects (Sect. 1). 

Regarding claim 57, the arguments analogous to those presented above for claim 53 are 
applicable to claim 57. Ariki discloses facial feature extraction to enable searching for 
individual human faces (Abstract). 

14. Claims 55 and 56 are rejected under 35 U.S.C. 103(a) as being unpatentable over Ariki et 
al. ("Face Indexing on Video Data - Extraction, Recognition, Tracking and Modeling") in view 
of Belongie et al. ("Color- Texture-Based Image Segmentation Using EM and Its Application to 
Content-Based Image Retrieval") as applied to claim 54 above, and further in view of King et al. 
(5,600,775). 

Regarding claims 55 and 56, Ariki and Belongie do not appear to disclose including an 
editor for scanning the indexing store and for correcting errors occurring therein. However, King 
et al. ("King") discloses including an editor for scanning the indexing data store and correcting 
errors and annotating the indexing data store with annotations (Abstract; Col. 2, lines 3-67, Col. 
3, lines 1-35; Figure 5). At the time of the invention, it would have been obvious to one of 
ordinary skill in the art to have modified the indexing disclosed by Ariki and Belongie to include 
an editor. The motivation for doing so would have been because it is a well known graphical 
user interface and enhances the systems searching and indexing capabilities. Therefore, it would 
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have been obvious to one of ordinary skill in the art to combine Ariki and Belongie with King to 
obtain the invention as specified in claims 55 and 56. 
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